2016 may be long past, but Beyoncé’s vulnerable, emotional, and powerfully feminist album “Lemonade” will inarguably always be relevant. The album is accompanied by an hour-long film broken into twelve songs and eleven “chapters” all weaving together a narrative of heart break, betrayal, loss, black female empowerment and ultimately hope. The following text analysis dives deep into each song’s lyrics per chapter. Lyrics were found on Genius.com and transferred to text files for analysis. For the purposes of this analysis, the twelfth song, “Formation” was left out due to it’s lack of assignment to a “chapter.”
txts <- VCorpus(DirSource(directory = "/Users/sophiebeiers/Documents/GitHub/Lemonade/data/txts"))
txts.df <- read_csv("../data/chps.csv")
meta(txts, type="local", tag="title") <- txts.df$song_title
meta(txts, type="local", tag="chapter") <- txts.df$chapter
exceptions <- c("i", "we", "we're", "we've", "we'd", "us", "they", "them", "they've", "they're", "he", "she", "she'll", "she'd", "she's", "you", "you're", "you've", "your","him", "my", "mine", "her", "hers", "his", "him", "sorry")
exceptions2 <- c("sorry", "said")
my_stopwords <- setdiff(stopwords("en"), exceptions)
reg_stopwords <- setdiff(stopwords("en"), exceptions2)
# prepping data
corpus <- tm_map(txts, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, my_stopwords)
# document term matrix
doctm <- DocumentTermMatrix(corpus) # create DocumentTermMatrix
doctm_m <- as.matrix(doctm)
# term document matrix
doc_tdm <- TermDocumentMatrix(corpus)
doc_tdm_m <- as.matrix(doc_tdm)
# tidy object
doc_td <- tidy(doc_tdm)
# merge information
meta <- as_data_frame(str_split_fixed(doc_td$document, "_", n = 3))
colnames(meta) <- c("Order", "Song_title", "Chapter")
doc_td <- as_data_frame(cbind(doc_td, meta))
doc_td$Chapter <- str_sub(doc_td$Chapter, 1, str_length(doc_td$Chapter)-4)
# bag of words
words <- txts.df %>%
unnest_tokens(word, lyrics) %>%
mutate(row = row_number())To begin, I simply split each song into a “bag of words” and counted the most common words that came up throughout the entirety of the album. Unsurprisingly, “love” comes up 81 times; many of the lyrics in Beyoncé’s journey reference her relationship with Jay-Z, positively or negatively. Other common words include “sorry,” likely mainly from the song “Sorry” in which she explains repeatedly how NOT sorry she is about Jay Z’s affair, and “feel,” which is certainly fitting given the range of emotions Beyoncé expresses throughout the album.
topwords <- doc_td %>%
#anti_join(stop_words, by= c("term" = "word")) %>%
filter(!(term %in% reg_stopwords)) %>%
group_by(term) %>%
summarise(n = sum(count)) %>%
top_n(n = 20, wt = n) %>%
ungroup() %>%
mutate(term = reorder(term, n))
ggplot(data = topwords, aes(term, n)) +
geom_bar(stat = "identity", fill = "#ffcc00") +
geom_text(aes(label = term, x=term, y = 1), hjust = 0,
size = 4, color = "#b35900") +
geom_text(aes(label= n, x=term, y = n-1), hjust = 1, color= 'white') +
xlab(NULL) +coord_flip() + theme_light() +
theme(axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
labs(x = '', title = "Most Frequent Words in Beyonce's Lemonade",
subtitle = "Number of times word appeared in album")Next, I split the words up by song to find which lyrics are most popular in each song. Unsurprisingly, the title of the song is featured in most top 10 lists (i.e. “sorry” in Sorry, “pray” in Pray You Catch Me, “daddy” in Daddy Lessons). From the visualizations, we can also see which songs are more repetitive than others. For instance, the songs “Sorry”, “Hold Up” and “All Night” use the top 10 lyrics repeatedly whereas songs like “Forward” and “Love Drought” feature fewer words.
# order chapters and songs
doc_td$Song_title[doc_td$Chapter == "Emptiness"] <- "6 Inch"
doc_td$Song_title[doc_td$Chapter == "Anger"] <- "Don't Hurt Yourself"
doc_td$Song_title2 <- paste(doc_td$Order, sep = '. ', doc_td$Song_title)
doc_td$Song_title <- factor(doc_td$Song_title, levels = c("Pray You Catch Me", "Hold Up", "Don't Hurt Yourself", "Sorry", "6 Inch", "Daddy Lessons", "Love Drought", "Sand Castles", "Forward", "Freedom", "All Night"))
doc_td$Song_title2 <- factor(doc_td$Song_title2, levels = c("1. Pray You Catch Me", "2. Hold Up", "3. Don't Hurt Yourself", "4. Sorry", "5. 6 Inch", "6. Daddy Lessons", "7. Love Drought", "8. Sand Castles", "9. Forward", "10. Freedom", "11. All Night"))
doc_td$Chapter <- factor(doc_td$Chapter, levels = c("Intuition", "Denial", "Anger", "Apathy", "Emptiness", "Accountability", "Reformation", "Forgiveness", "Ressurection", "Hope", "Redemption"))
topwords_song <- doc_td %>%
filter(!(term %in% reg_stopwords)) %>%
arrange_(~ desc(count)) %>%
group_by_(~ Song_title2) %>%
slice(1:10) %>%
select(Song_title2, term, count)
ggplot(data = topwords_song, aes(reorder(term, count), count)) +
geom_bar(stat = "identity", fill = "#ffcc00") +
coord_flip() + theme_light() +
theme(axis.title = element_blank(),
axis.text.x = element_blank(),
axis.ticks = element_blank()) +
labs(x = '',
title = "Most Frequent Words in Beyonce's Lemonade",
subtitle = "Number of times word appeared in each song") +
facet_wrap(~Song_title2, scales = "free_y") +
theme(strip.text = element_text(size=8, color = 'black'),
strip.background = element_blank()) I ran correlation analysis on the lyrics of each song to better understand which songs were more like or unlike each other. From the table, we can see that “Hold Up” and “Love Drought” are the most similar with a correlation coefficient of 0.696. Similarly, “Love Drought” and “Pray You Catch Me” hold the second highest correlation coefficient of 0.588. All three of these songs make a lot of references to “love” and relationships. When listening to the songs, Pray You Catch me and Love Drought are more similar in emotion while Hold Up is quite a bit more upbeat than the other two, but the lyrics may not reflect this. Other song lyrics are only weakly correlated (coefficient < 0.3). Funnily, most correlations are positive, but the few negative correlations are very weak (i.e. between “Forward” and “6 Inch” or “Sorry” and “6 Inch”). All negative relationships include 6 Inch; this song is arguably one of the most intense of the album, so is clearly quite different from the others.
colnames(doc_tdm_m) <- c("Pray You Catch Me", "Freedom", "All Night", "Hold Up", "Don't Hurt Yourself", "Sorry", "6 Inch", "Daddy Lessons", "Love Drought", "Sand Castles", "Forward")
stargazer(cor(doc_tdm_m), type = "text")colnames(doc_tdm_m) <- c("Pray You Catch Me", "Freedom", "All Night", "Hold Up", "Don't Hurt Yourself", "Sorry", "6 Inch", "Daddy Lessons", "Love Drought", "Sand Castles", "Forward")
cor <- cor(doc_tdm_m)
corrplot(cor, type = "upper", tl.col = "black", tl.srt = 45)“TF-IDF” (term frequency-inverse document frequency) reflects the most important words for each song or album by decreasing the weight of words not seen frequently overall and increasing the weight of words not seen very much in the entire album as a whole. Thus, the following words per song are the most unique or important to that song in relation to the word’s appearance in the album as a whole.
We see some interesting themes come up: it’s now clear that “6 Inch” now clearly speaks a lot about women (“she”, “her”) and likely about empowering women (“she”, “works”, “money”). “6 inch” refers to a metaphor for 6-inch heels and seeks to amplify the fierceness and independent power of women! The title “Sand Castles” may not leave us with much, but its most important words “promise,” “work, and”couldn’t" give us some insight into its topic. In “Sand Castles,” Beyoncé reminisces about her flawed but beautiful relationship with Jay Z. Promises were kept, promises were broken. The EMOTION!
# tf_idf
doc_tf_idf <- doc_td %>%
bind_tf_idf(term, document, count) %>%
arrange(desc(tf_idf))
doc_tf_idf %>%
filter(!(term %in% c("can","said"))) %>%
arrange_(~ desc(tf_idf)) %>%
group_by(Song_title2) %>%
slice(1:5) %>%
ggplot(aes(x = reorder(term, desc(-tf_idf)), y = tf_idf)) +
geom_bar(stat = "identity", fill = "#ffcc00") + coord_flip() +
facet_wrap(~ Song_title2, scales="free_y") +
theme(axis.title = element_blank(),
axis.text.x = element_blank(),
axis.ticks = element_blank()) +
labs(x = '', y = '', title = "Most Important & Distinct Words Per Song") +
theme_light() +
theme(strip.text = element_text(size=8, color = 'black'),
strip.background = element_blank()) I next looked into how many words were unique in each song. “Freedom” uses the largest amount of unique words, while songs like “Forward” and “Sand Castle” are shorter and more repetitive. To make this analysis better next time, I would look at the frequency of unique words out of the song as a whole so the analysis is proportionate to the length of song.
# need to reorder
doc_td %>%
group_by(Song_title) %>%
mutate(unq_wrds = length(term)) %>%
select(Song_title, unq_wrds) %>%
unique() %>%
ggplot(aes(x = reorder(Song_title, unq_wrds), y = unq_wrds)) +
geom_segment(aes(x = Song_title, y = 0, xend = Song_title, yend = unq_wrds),
color = "#ffcc00", size = .5) +
geom_point(size = 7, color = "#ffcc00") +
geom_text(aes(label = unq_wrds), color = "black", size = 3, check_overlap = TRUE) +
labs(x = "", y = "", title = "Number of Unique Words Per Song") +
theme_light() +
coord_flip() +
theme(legend.position="none") +
theme(
axis.text.x = element_blank(),
axis.ticks = element_blank()) I was interested in looking into which songs talked more about men vs. women. To do so, I created a list of words referring to females (i.e. “her”, “she”, “girl”) and a list of words referring to males (i.e. “he”, “daddy”) and plotted the number of times each of these words appeared in each song.
We can see that 6 Inch talks primarily about women; as we recall, 6 Inch is that epic song about female badass-ery, so no surprises here. Daddy Lessons, also unsurprisingly, is mainly about men. Out of all eleven songs, these six were the only ones even referring to females or males, so overall, Beyoncé’s lyrics don’t tend to be overbearingly gendered.
# words set up
pronouns <- c("i", "we", "us", "they", "them", "he", "she", "shes", "you", "him", "her", "hers", "my", "mine")
fem_wrds <- c("her", "hers", "she","shes", "hers", "woman", "women", "girl", "girls", "mother", "i", "daughter", "daughters", "sister", "sisters", "momma", "Becky", "wife", "herself")
male_wrds <- c("he", "his", "hes", "man", "men", "mans", "boy", "boys", "daddy", "son", "sons", "brother", "brothers", "father")
together_wrds <- c("we", "us", "together")
you_wrds <- c("you", "your", "youre")
i_words <- c("i", "im", "me", "my")
# gender analysis
shehe_song <- doc_td %>%
filter((term %in% fem_wrds|term %in% male_wrds)) %>%
arrange_(~ desc(count)) %>%
group_by_(~ Song_title) %>%
select(Song_title, term, count)
shehe_song$male <- ifelse(shehe_song$term %in% male_wrds, -(shehe_song$count), 0)
shehe_song$female <- ifelse(shehe_song$term %in% fem_wrds, shehe_song$count, 0)
shehe_song$gender <- shehe_song$female + shehe_song$male
shehe_song <- shehe_song %>%
dplyr::group_by(Song_title) %>%
dplyr::mutate(gender2 = sum(gender))
shehe_song <- shehe_song %>%
dplyr::group_by(Song_title) %>%
dplyr::mutate(female2 = sum(female), male2 = sum(male))
shehe_song$gendertot <- ifelse(shehe_song$gender2 > 0, "female", "male")
shehe_song$gendertot <- factor(shehe_song$gendertot, levels = c("female", "male"))
# visualization
ggplot(shehe_song, aes(x = Song_title, y = gender2)) +
geom_segment(data = filter(shehe_song, male2 < 0),
aes(x = Song_title, y = 0,
xend = Song_title, yend = male2, color = "Words about Males"),
inherit.aes = FALSE,
arrow = arrow(length = unit(0.1, "inches"), ends = "last")) +
geom_segment(data = filter(shehe_song, female2 > 1),
aes(x = Song_title, y = 0, xend = Song_title,
yend = female2, color = "Words about Females"),
inherit.aes = FALSE,
arrow = arrow(length = unit(0.1, "inches"), ends = "last")) +
scale_color_manual(values=c("#ffcc00","#b3de69"), name = "") +
coord_flip() +
theme_light() +
labs(x = '', y = '', title = 'He/She in Lemonade Songs',
caption = "(number of female or male pronouns used in each song) ") +
theme(legend.position="top") +
theme(axis.ticks = element_blank())I then used bigrams (two word segments of the lyrics) to understand which words usually came after feminine and male pronouns. Because the words “he” and “she” were rarely used in the album, I extended this analysis to look at words that came after any feminine or male pronoun, as defined above.
The men in Beyoncé’s journey “said” and “told” things and… Beyoncé probably said “boy, bye” to them/him a few times. The women in Lemonade “worked” and “grind-ed” and “ain’t” sorry. They “loved” and “prayed” and generally felt and did more than the men.
# she/he bigrams
bigrams <- txts.df %>%
unnest_tokens(bigram, lyrics, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ")
she_bigram <- bigrams %>%
filter(word1 %in% fem_wrds) %>%
count(word2, sort = TRUE) %>%
slice(1:10)
she_bigram$type <- "Words After 'She'"
colnames(she_bigram)[1] <- "word"
he_bigram <- bigrams %>%
filter(word1 %in% male_wrds) %>%
count(word2, sort = TRUE) %>%
slice(1:10)
he_bigram$type <- "Words After 'He'"
colnames(he_bigram)[1] <- "word"
bigrams_top10 <- rbind(she_bigram, he_bigram)
ggplot(data = bigrams_top10, aes(x = reorder(word, desc(-n)), y = n)) +
geom_bar(aes(fill = type), stat = "identity") + coord_flip() +
theme(axis.title = element_blank(),
axis.text.x = element_blank(),
axis.ticks = element_blank()) +
scale_fill_manual(values=c("#b3de69","#ffff80")) +
labs(x = '', y = '', title = "") +
facet_wrap(~type, scales = "free") +
theme_minimal(base_size = 11) +
theme(legend.position = "none")The below idea was taken from an R-bloggers article on the “Weinsten Effect,” but I adapted it for Lemonade analysis. To understand the magnitude of when these words are used after male or female pronouns, I transform the bi grams into a log ratio of she/he. Thus, the sign of the log ratio indicates whether the second word appears more after she (positive) or he (negative) and the absolute value of the log ratio indicates the magnitude. Below, we can see which words appear after male pronouns and female pronouns relative to the other gendered word.
Several trends emerge: first, we see that “he” (likely mainly Jay-Z) says, tells, and swears (likely as in “promises”) a lot. Does he stick to his word? There’s also a fair amount of violence associated to the male pronoun. When Beyoncé talks about girls/women, they emote and DO more (with words like “love”, “pray”, “worked”, “grinds”).
# change words so any fem words become "she" and any male words become "he"
bigrams$word1 <- ifelse(bigrams$word1 %in% male_wrds, "he", bigrams$word1)
bigrams$word1 <- ifelse(bigrams$word1 %in% fem_wrds, "she", bigrams$word1)
# code borrowed from: https://www.r-bloggers.com/a-tidytext-analysis-of-the-weinstein-effect/
he_she_bigram <- bigrams %>%
filter(word1 %in% c("he", "she"))
he_she_bigram <- he_she_bigram %>%
count(word1, word2) %>%
spread(word1, n, fill = 0) %>%
mutate(total = he + she,
he = (he + 1) / sum(he + 1),
she = (she + 1) / sum(she + 1),
log.ratio = log2(she / he),
abs.ratio = abs(log.ratio)) %>%
arrange(desc(log.ratio))
he_she_bigram %>%
group_by(direction = ifelse(log.ratio > 0, 'More female', "More male")) %>%
top_n(10, abs.ratio) %>%
ungroup() %>%
mutate(word2 = reorder(word2, log.ratio)) %>%
ggplot(aes(word2, log.ratio, fill = direction)) +
geom_col() +
coord_flip() +
labs(x = "",
y = 'Relative appearance after feminine pronouns compared to male pronouns',
fill = "",
title = "Lemonade",
subtitle = "The Most Gendered Words After Female/Male Pronouns") +
scale_fill_manual(values=c("#ffff80", "#b3de69")) +
scale_y_continuous(labels = c("8X", "6X", "4X", "2X", "Same", "2X", "4X", "6X", "8X"),
breaks = seq(-4, 4)) +
guides(fill = guide_legend(reverse = TRUE)) +
expand_limits(y = c(-4, 4)) +
theme_minimal()Lemonade is EMOTIONAL and raw, no doubt, but do the titles of each Chapter correctly label the emotion at hand? To better understand the true emotional range of Lemonade, I joined the “AFINN” lexicon to the album lyrics, which rates the magnitude of positivity and negativity to each word on a scale of -5 (most negative) to 5 (most positive). I then created a cumulative counter that added each negative or positive score to the previous word’s score. In the end, we can see the emotional (positive vs. negative) flow of the album in its entirety.
Intuition and Denial start us off strong, then the album takes a negative turn when we get to Anger and Apathy. Things remain fairly negative throughout Emptiness, Accountability and Forgiveness, then Resurrection, Hope and ultimately Redemption bring us way back up and let us end on a high note. I would guess that the spurts of positive words within Emptiness and Accountability are likely words like “like” or “love” but are actually being talked about in a more sad, negative way.
# get sentiments lexicon
afinn <- get_sentiments("afinn")
words <- words %>%
inner_join(afinn)
# loop over rows to add up score
x = 0
for (i in seq(1:502)) {
words$tot[i] <- words$score[i] + x
x = words$tot[i]
}
# reset index
words <- words %>%
mutate(row = row_number())
# create order to chapters
words$chapter_txt <- ifelse(words$chapter == "Intuition", "1. Intuition", words$chapter)
words$chapter_txt <- ifelse(words$chapter == "Denial", "2. Denial", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Anger", "3. Anger", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Apathy", "4. Apathy", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Emptiness", "5. Emptiness", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Accountability", "6. Accountability", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Reformation", "7. Reformation", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Forgiveness", "8. Forgiveness", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Ressurection", "9. Ressurection", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Hope", "10. Hope", words$chapter_txt)
words$chapter_txt <- ifelse(words$chapter == "Redemption", "11. Redemption", words$chapter_txt)
words$chapter_txt <- factor(words$chapter_txt, levels = c("1. Intuition", "2. Denial", "3. Anger", "4. Apathy", "5. Emptiness", "6. Accountability", "7. Reformation", "8. Forgiveness", "9. Ressurection", "10. Hope", "11. Redemption"))
words <- filter(words, !is.na(words$chapter))
words$chapter <- factor(words$chapter, levels = c("Intuition", "Denial", "Anger", "Apathy", "Emptiness", "Accountability", "Reformation", "Forgiveness", "Ressurection", "Hope", "Redemption"))
# over time visualization
ggplot(data = words, aes(x = row, y = tot)) +
geom_area(aes(fill = chapter_txt), alpha = 0.9, color = "white") +
scale_fill_manual(values=c("#ffff80","#ffff33", "#FFA500","#FF8C00", "#FF6126", "#FF4500", "#FF6347", "#FF8C00", "#ffca50", "#ffdb66", "#ff8172"), name = "Chapter") +
labs(x = '', y = '', title = 'Lemonade: Emotional Range of Chapters Over Time\n', subtitle = '') +
ggplot2::annotate(geom = "text", x = 5, y = 100,
label = "Positive Words", size = 3, hjust = -0.1, vjust = -.1) +
ggplot2::annotate("text", x = 5, y = 80,
label = "Negative Words",
size = 3, hjust = -0.1, vjust = -.1) +
ggplot2::annotate("text", x = 200, y = 150,
label = "Album Chapters",
size = 3, hjust = -0.1, vjust = -.1) +
geom_segment(aes(x = 0, xend = 0 , y = 100, yend = 130),
arrow = arrow(length = unit(0.2,"cm")), color = "#99ff33") +
geom_segment(aes(x = 0, xend = 0 , y = 80, yend = 50),
arrow = arrow(length = unit(0.2,"cm")), color = "#F8766D") +
geom_segment(aes(x = 275, xend = 315 , y = 152, yend = 152),
arrow = arrow(length = unit(0.2,"cm")), color = "gray") +
theme_void() +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
legend.position = "top")Lastly, I made the visualization interactive so that the user can scroll over the plot and see the chapter title, song title and word at hand that contributes to the positive or negative accumulation over time.
## An attempt at interactivity
library(plotly)
colnames(words)[colnames(words) == "chapter_txt"] <- "Chapter"
emo <- ggplot(data = words, aes(x = row, y = tot)) +
geom_area(aes(fill = Chapter, label = word, text = song_title), alpha = 0.9, color = "white") +
scale_fill_manual(values=c("#ffff80","#ffff33", "#FFA500","#FF8C00", "#FF6126", "#FF4500", "#FF6347", "#FF8C00", "#ffca50", "#ffdb66", "#ff8172"), name = "Chapter") +
labs(x = '', y = '', title = 'Lemonade: Emotional Range of Chapters Over Time\n', subtitle = '') +
theme_void() +
theme(text=element_text(size=16,
family="Serif")) +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
legend.position = "none")
# emo <- emo %>%
# add_fun(function(emo) {
# emo %>%
# add_segments(x = 0, xend = 0, y = 100, yend = 175) %>%
# add_annotations("Minimum uncertainty")
# })
ggplotly(p = emo, tooltip = c("Chapter", "song_title", "word"))Lastly, because a sentiment package for “anger” exists in the NRC lexicon, I tested its use against the chapter, “Anger” and built a word cloud. Below indicates the words from the chapter “Anger” that are most associated with anger.
nrc_anger <- get_sentiments("nrc") %>%
filter(sentiment == "anger"| sentiment == "negative")
doc_td$word <- doc_td$term
anger <- doc_td %>%
filter(Chapter == "Anger") %>%
inner_join(nrc_anger) %>%
count(word, sort = TRUE)
wordcloud(words = anger$word, freq = anger$n, max.words = 20,
random.order = FALSE, rot.per = 0.35, colors = brewer.pal(2, "Dark2"))